ln 4
- Europe > Russia (0.14)
- Asia > Russia (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- (3 more...)
Are Greedy Task Orderings Better Than Random in Continual Linear Regression?
Tsipory, Matan, Levinstein, Ran, Evron, Itay, Kong, Mark, Needell, Deanna, Soudry, Daniel
We analyze task orderings in continual learning for linear regression, assuming joint realizability of training data. We focus on orderings that greedily maximize dissimilarity between consecutive tasks, a concept briefly explored in prior work but still surrounded by open questions. Using tools from the Kaczmarz method literature, we formalize such orderings and develop geometric and algebraic intuitions around them. Empirically, we demonstrate that greedy orderings converge faster than random ones in terms of the average loss across tasks, both for linear regression with random data and for linear probing on CIFAR-100 classification tasks. Analytically, in a high-rank regression setting, we prove a loss bound for greedy orderings analogous to that of random ones. However, under general rank, we establish a repetition-dependent separation. Specifically, while prior work showed that for random orderings, with or without replacement, the average loss after $k$ iterations is bounded by $\mathcal{O}(1/\sqrt{k})$, we prove that single-pass greedy orderings may fail catastrophically, whereas those allowing repetition converge at rate $\mathcal{O}(1/\sqrt[3]{k})$. Overall, we reveal nuances within and between greedy and random orderings.
- North America > United States > California > Los Angeles County > Los Angeles (0.13)
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (8 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Memorizing Long-tail Data Can Help Generalization Through Composition
Zhou, Mo, Ma, Haoyang, Ge, Rong
The relationship between memorization and generalization has always been an intriguing topic in deep learning. It has long been known that neural networks used for supervised learning can memorize noisy or even random labels (Zhang et al., 2017), and recent large language models can still memorize long text despite not being overparametrized (Carlini et al., 2019, 2022). Conventional wisdom from statistical learning theory suggests that memorization might be detrimental to generalization performance, yet neural networks often generalize well with memorization, sometimes even better compared to networks that do not memorize. Trying to understand this relationship from a theoretical perspective has led to the idea of implicit regularization and benign overfitting (Belkin et al., 2019; Bartlett et al., 2020; Belkin et al., 2020; Hastie et al., 2022; Gunasekar et al., 2017), where the training process and architecture choices of neural networks prefer certain solutions that have good generalization despite memorizing the training data. Another interesting line of work (Feldman, 2020; Feldman and Zhang, 2020) demonstrated that memorization can actively help generalization by capturing long-tail behaviors in the training data.
PAC Learnability in the Presence of Performativity
Kirev, Ivan, Baltadzhiev, Lyuben, Konstantinov, Nikola
Following the wide-spread adoption of machine learning models in real-world applications, the phenomenon of performativity, i.e. model-dependent shifts in the test distribution, becomes increasingly prevalent. Unfortunately, since models are usually trained solely based on samples from the original (unshifted) distribution, this performative shift may lead to decreased test-time performance. In this paper, we study the question of whether and when performative binary classification problems are learnable, via the lens of the classic PAC (Probably Approximately Correct) learning framework. We motivate several performative scenarios, accounting in particular for linear shifts in the label distribution, as well as for more general changes in both the labels and the features. We construct a performative empirical risk function, which depends only on data from the original distribution and on the type performative effect, and is yet an unbiased estimate of the true risk of a classifier on the shifted distribution. Minimizing this notion of performative risk allows us to show that any PAC-learnable hypothesis space in the standard binary classification setting remains PAC-learnable for the considered performative scenarios. We also conduct an extensive experimental evaluation of our performative risk minimization method and showcase benefits on synthetic and real data.
- North America > United States > California (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France (0.04)
- Europe > Bulgaria (0.04)
- Health & Medicine (0.68)
- Banking & Finance > Credit (0.46)
- Education (0.46)
Improved Regret Bounds for Gaussian Process Upper Confidence Bound in Bayesian Optimization
This paper addresses the Bayesian optimization problem (also referred to as the Bayesian setting of the Gaussian process bandit), where the learner seeks to minimize the regret under a function drawn from a known Gaussian process (GP). Under a Matérn kernel with a certain degree of smoothness, we show that the Gaussian process upper confidence bound (GP-UCB) algorithm achieves $\tilde{O}(\sqrt{T})$ cumulative regret with high probability. Furthermore, our analysis yields $O(\sqrt{T \ln^2 T})$ regret under a squared exponential kernel. These results fill the gap between the existing regret upper bound for GP-UCB and the best-known bound provided by Scarlett (2018). The key idea in our proof is to capture the concentration behavior of the input sequence realized by GP-UCB, enabling a more refined analysis of the GP's information gain.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Russia > Siberian Federal District > Novosibirsk Oblast > Novosibirsk (0.04)
Group Distributionally Robust Optimization with Flexible Sample Queries
Bai, Haomin, Yu, Dingzhi, Li, Shuai, Luo, Haipeng, Zhang, Lijun
Group distributionally robust optimization (GDRO) aims to develop models that perform well across $m$ distributions simultaneously. Existing GDRO algorithms can only process a fixed number of samples per iteration, either 1 or $m$, and therefore can not support scenarios where the sample size varies dynamically. To address this limitation, we investigate GDRO with flexible sample queries and cast it as a two-player game: one player solves an online convex optimization problem, while the other tackles a prediction with limited advice (PLA) problem. Within such a game, we propose a novel PLA algorithm, constructing appropriate loss estimators for cases where the sample size is either 1 or not, and updating the decision using follow-the-regularized-leader. Then, we establish the first high-probability regret bound for non-oblivious PLA. Building upon the above approach, we develop a GDRO algorithm that allows an arbitrary and varying sample size per round, achieving a high-probability optimization error bound of $O\left(\frac{1}{t}\sqrt{\sum_{j=1}^t \frac{m}{r_j}\log m}\right)$, where $r_t$ denotes the sample size at round $t$. This result demonstrates that the optimization error decreases as the number of samples increases and implies a consistent sample complexity of $O(m\log (m)/ε^2)$ for any fixed sample size $r\in[m]$, aligning with existing bounds for cases of $r=1$ or $m$. We validate our approach on synthetic binary and real-world multi-class datasets.
- North America > United States > California (0.14)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)